Predicting Red Wine Quality

Phase 1: Data Preparation & Visualisation

Group 25

Galen Ralph Herten-Crabb 3955778


Contents


Introduction

Dataset Source

The Wine Quality-Red dataset comes from the University of California, School of Information and Computer Science Machine Learning Repository (Cortez et al., 2009) and is one of two datasets created by a team of researchers to determine characteristics related to wine quality in both red and white wines. Both datasets are related to red and white variants of the Portuguese "Vinho Verde" wine.

Dataset Details

Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available, so we don't know what grapes are used or which brands participated. What we do have is the qualitative evaluation of many anonymous wines and their particular chemical characteristics. These include fixed acidity, volatile acidity, citric acid, residual sugars, chlorides, free sulphur dioxide, total sulphur dioxide, density, pH, sulphates, alcohol and finaly quality.

12 variables in all and 1599 observations

Dataset Retrieval

Dataset Features

The table below describes each feature of the dataset. As many of the variables are scientific in nature definitions will be sourced from the original paper this data was produced for, referenced below.

All units are in grams or milligrams per 10 cubic meters or cubic centimeters where indicated.

Target Feature

For this project, the target feature will be the subjective sensory grading or "Quality" variable. This score represents the mean score (0-10) of three grades, given by three different human wine tasters for each record in the dataset. 0 represents a very poor score and 10 a very fine one. It is this score that the project hopes to predict using the descriptive physiochemical charateristics that are the variables in this dataset.

Goals and objectives

The goal of this project is to identify the key factors involved in producing consistently high quality red wine and using those factors to predict quality before going to market. Many agricultural and chemical processes are used when creating red wine and to know which individual charateristic, or combination or characteristics, are desirable can dramaticaly improve quality and maximise ROI when planning out the growing season.

The international wine market is saturated with thousands of products all with disperate offerings, but taste must always be paramount so locking down the recipe could save huge amounts of money in the future, that may otherwise have been spent experimenting in directions this model may already have ruled out, were it to be applied properly.

Goals

  1. Predict the quality of red wine
  2. Identify the chemical features and/or combination of features that determine red wine quality.

Data Cleaning and Preprocessing

This dataset has already been cleaned as it comes from a ML depository and is labeled as such. There are no missing values, no index or ID columns and all variabes are required as they all represent relevant chemical factors in the sensory appreciation of red wine.

The dataset is also of a managable size so no sampling will be required, the entire set may be used.

The one preparatory action that was taken was to remove spaces from coloumn names to facilicate further processing and visualisation.

Data Exploration & Visualisation

Univariate Visualisation

Histogram of Red Wine Quality

This first visualisation shows the distribution of wine quality over all records, it is important to understand if the data is skewed in any way and to see what we're working with and whether transformation is nesessary later on. The plot reveals that the data skews some what to the right and may indeed require transformation in phase 2.

While quality is a score from 0-10, however, the dataset maximum is 8 as the score is the mean of 3 seperate taste scores.

Box plot of alcohol content

Alcohol is a key component of red wine but it may not necessarily add to its quality of, in fact often too strong an alcohol content can negatively affect the sensory experience. Alcohol content is also relevant in other areas of viniculture and may represent the primacy of certain techniques.

This box plot exists to ensure that the data does not present a wide spread of alcohol as alcohol content can be a classification on its own.

Alcohol content seems to have several outliers that may need to be addressed. Otherwise, no real surprises as alcohol content sits within a narrow range.

Two-Variable Visualisation

Boxplot of Alcohol and Quality

As mentioned above, alcohol content is a consideration is assessing the quality of red wine and can be indicative of certain agricultural practices and fermenting techniques. This boxplot seeks to learn if there is any significant relationship between the two variables e.g. if there is significant variation in alcohol content at each level of quality and if there is any correlation between alcohol content and quality.

The plot shows us that there seems to be a correlation between alcohol content and quality but the large number of outliers at the middle level suggests either more cleaning is required or alcohol may only be a co-factor in quality.

Scatterplot pH and Alcohol content

This plot attempts to detect any correlation between alcohol content and another critical element in red wine production, that of pH level. pH influences microbiological stability, affects the equilibrium of tartrate salts, determines the effectiveness of sulphur dioxide and enzyme additions, influences the solubility of proteins and effectiveness of bentonite, and affects red wine colour and oxidative and browning reactions (Boulton et al, 1996)

pH is a fundamental quality in red wine that must fall within a certain 'sweet spot' otherwise all may be lost. There will likely be a correlation between a certain pH level and quality, but this plot seeks to reveal pH levels relationship with alcohol.

We see below that there is no strong correlation between the two variables.

Three-Variable Visualisation

Pairplot of Dataset

This plot seeks to learn if there are any strong correlations between the descriptive and target features, and between the descriptve features themselves.

Several strong relationships, both positive and negative can be seen between the different acids and pH level, which is to be expected. There is also a clear negative relationship between quality and volatile acids. And some interesting, but possibly irrelevant, relationships between density and other dependant variables like pH.

Scatterplot of Quality broken down by Alcohol and Citric Acid

From the above plot we learn many things. Interestingly there appears to be very few positive correlations relating directly to quality outside of alcohol content, density and citric acid. Which is what the below plot wil explore.

The plot does suggest a positive trend for the two descriptive variables towards higher quality red wine, as the data points grow darker as you move towards the upper right corner.

Summary & Conclusions

Being able to predict the quality of a red wine before going through the rigors of bottling, marketing and distribution could save a business a tremendous amount of time and money. The goal of this project is to identify what chemical factors relate to a high scoring wine and ultimately predicting said quality straight out of the barrel through chemical analysis and this model.

For the first stage of this project the data was read, and the headings modified to facilitate further visualisations. Due to the data coming from a ML depository there were no missing values and no need to drop any variables as they were all relevant because they make up key chemical elements in wine analysis.

Exploring the data through numerous visualisation methods revealed that the bulk of wines tested sit at a score of 5 & 6 but that there was some skewing to the right that may require further attention. Alcohol content sits in a range between 8% and 15% with most wines resting somewhere between 9.5% and 11.1%. Furthering the exploration of alcohols relationship to quality, it was discovered that higher quality wines did possess a higher alcohol content suggesting a correlation between the two. During research pH level was found to be another significant factor influencing wine quality, but no correlation was found when plotted with alcohol, these two variables seem important but are independent from each other.

A pair plot allowed for each variable to be measured against each other to investigate further any unexpected correlations. Many found were to be expected e.g. a strong negative correlation between the several acids present and pH level and a negative trend between quality and volatile acids. But a relationship between citric acid and quality was found and when the quality variable was broken down by alcohol content and citric acid content it revealed a clear relationship between the low content of these variables and low quality scores which shows that these two ingredients are critical to the quality of the wine and therefore the purpose of this project.

References

Boulton, R. B., Singleton, V. L., Bisson, L. F., & Kunkee, R. E. (2013). Principles and practices of winemaking. Springer Science & Business Media.

Cortez, P., Cerdeira, A., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4), 547-553.